程序代写代做代考 algorithm deep learning C Neural Networks CMPUT 366: Intelligent Systems


Neural Networks CMPUT 366: Intelligent Systems


GBC §6.0-6.4.1

1. Recap
2. Nonlinear models
3. Feedforward neural networks
Lecture Outline




Partial derivatives are derivatives of “frozen” function: ∂ f(x,y) = d (f)y=y(x)

∂x dx
Gradient of a function is a vector of all its partial derivatives:
∂ ∂x ∂ ∂y
Recap: Calculus
Derivatives can be used for optimization
Minimization: Increase x if derivative is negative & vice versa
(∇f)(x,y) =
f(x, y) f(x, y)



 

• •
• •
weights T y=f(x;w)=g(w x)=g
inputs
( ∑n activation

Linear Models
Supervised models we have considered so far have been linear:

Linear model
wixi i=1
)
Linear classification / regression
Logistic regression
Advantages: Efficient to fit (closed form sometimes!) Disadvantages: Can be really limited
function


Question: What else could we do?
Figure 6.1, left
The function f(x1, x2) = (x1 XOR x2)

XOR is not linearly separable

is not linearly separable


There is no way to draw a straight line with all of the 1’s on one side and all of the 0’s on the other
This means that no linear model can represent XOR exactly; there will always be some errors
Example: XOR
Original x space
1
0
01
x1
Figure 6.1: Solving the XOR problem by learning a repr printed on the plot indicate the value that the learned func
(Image: Goodfellow 2017)
(Left) A linear model applied directly to the original inp function. When x = 0, the model’s output must increase
(Goodfello
x2
1
w
e t
u

Nonlinear Features
y = f(x;w) = g(wTx) = g(∑n wixi) i=1
One option: Learn a linear model on richer inputs
1. Define a feature mapping 𝜙(x) that returns functions of the original inputs
2. Learn a linear model of the features instead of the inputs

y = f(x;w) = g(wTφ(x)) = g(∑n wi[φ(x)]i) i=1

Nonlinear Features for XOR

What additional features would help?

• φ(x1, x2) = [1,×1, x2, x1x2]
XOR is not linearly separable
WORK
1
0
Original x space
S
g XOR
01
x1
Question:

The product of x1 and x2!
CHAPTER 6. DEEP FEEDFORWARD NET
Solvin
Original x space
Figure 6.1: S
printed on th (Left) A line
function. W
the model’s
coefficient w
the coefficie
represented
the problem.
collapsed int
x1
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers
Figure 6.1, left
Learned h space
(Goodfellow 2017)
olving the XOR problem by lea epe
ar e hen
ou c
t re
nt p
by e
0
2
1
lot indicate the value that th model applied directly to th
x1 = 0, the model’s output tput must decrease as x2 in o x2. The linear model the
on x2 and cannot solve this the features extracted by a n
our example solution, the t
In w o a single point in feature space.
012
hx=[1,0]> andx=[0,1]> to
w = [−0.2, 0.5, 0.5, − 2] •

f(x; w) = wTφ(x) > 0 for (0,1) and (1,0)

T0 f(x;w)=w φ(x)<0for(1,1)and(0,0) 1 rning a representation. T learned function must out original input cannot im must increase as x2 increas reases. A linear model m fore cannot use the valu roblem. (Right) In the t ural network, a linear mo o points that must have o In other words, the nonli 01 mapped bot The linear model can now describe the function as increasing in h1 and Figure 6.1 representations can also help the model to generalize. printed on the plot indicate the value that the learned function must output at each point. x + x 1h1 2 a single point in feature s In this example, the motivation for learning the feature space is only t (Image: Goodfellow 2017) capacity greater so that it can fit the training set. In more realistic ap (Goodfellow 2017) x2 x×x 1h2 2 x2 (Left)A linear model applied directly to the original input cannot implement the XOR p u e r d u n o p • • Learning Nonlinear Features Manually constructing good features is extremely hard Manually constructed features are not transferrable between domains e.g., SIFT features were a revolution in computer vision, but are only for Deep learning aims to learn φ automatically from the data • • computer vision • • 
 
 w1 
 
 
 Neural Units Deep learning learns φ by composing little functions These function are called units
 b h x2 (n) i=1 x1 T h(x;w,b)=g(b+w x)=g b+∑wixi • Question: How is this different from a linear model? w2 
 offset weights activation
 function Feedforward Neural Network A neural network is many units composed together Feedforward neural network: Units arranged into layers Each layer takes outputs of previous layer as its inputs • • • x1 h1 x2 h2 y Example: XOR network x +1 h 1 -1 1 x -1 h 2 +1 2 +1 y • • • • Activation: g(z) = max{0,z} ("recified linear unit") Weights: [+1, − 1] for h1; [−1, + 1] for h2 [+1, + 1] for y +1 Matrix Representation xhy You can think of the outputs of • each layer as a vector h • The weights from all the outputs of a previous layer to each of the units of the layer can be collected into a matrix W x1 h1 x2 h2 The offset term for each unit can h = g (Wx + b) y • be collected into a vector b: Architecture x1 h1 x2 h2 Design decisions: 1. Depth: number of layers 2. Width: number of nodes in each layer 3. Fully connected? y Universal Approximation Theorem Theorem: (Hornik et al. 1989; Cybenko 1989; Leshno et al. 1993)
 A feedforward network with one hidden layer with a "squashing" activation or rectified linear activation and a linear output layer can approximate any function to within any given error bound, given enough hidden units. So a wide but shallow feedforward network can represent any Question: Why bother with multiple layers? (i.e., depth > 1)

function we’re trying to learn!



Training
Neural networks are trained using variants of gradient descent
e.g., stochastic gradient descent
Back propagation is an algorithm that allows for efficient computation of

the gradient
Modern frameworks can compute the gradient in other ways (e.g.,

automatic differentiation) even for complicated units


Hidden Unit Activations Default choice: Rectified linear units (ReLU)

g(z) = max{0,z} Other common types:


tanh(z) 1
• •
1+e−z
(sigmoid)
Sigmoid suffers from vanishing gradients; ReLU does not

• •


Summary Generalized linear models are insufficiently expressive
Composing GLMs into a network is arbitrarily expressive
A neural network with a single hidden layer can approximate any function But the network might need to be impractically large, prone to overfitting, or
Neural networks are trained using variants of gradient descent

inefficient to train
Architectural choices can make a network easier to train, less prone to

overfitting